{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cross-validation\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training/validation/test data sets\n",
    "\n",
    "* **Training set**: the data set for training your models.\n",
    "* **Validation set**: The data set used for testing the performance of your models you have built with training sets. Based on the performance, you choose the best model (final).\n",
    "* **Test set**: use this data set to test the performance of your final model.\n",
    "\n",
    "## K-folds cross validation steps (k=4 as an example).\n",
    "\n",
    "* step 1: split your data into training set and test set (for example 80% training and 20% test). Test set will never be used in model training and selection. \n",
    "* step 2: split training set into k (k=4) eqaul subsets: 3 subsets for traing + 1 subset for validation.\n",
    "* step 3: training your models with the 3 subsets and calculate a performance score with the remaining 1 subset.\n",
    "* step 4: choose a different subset for validation and then repeat step 3 until every subset has been used as a validation subset.\n",
    "* step 5: for a k=4 fold cross validation, each trained model should have been validated by 4 subsets and therefore has 4 performance scores. Calculate the average of these 4 perfermance scores for each model. Use the average score to select the best, final model.\n",
    "* step 6: apply your final model to the **untouched** test data and see how it performs.\n",
    "\n",
    "## Example of k-folds cross validation\n",
    "\n",
    "### Build parameter grids\n",
    "\n",
    "* parameter grid: a combination of all variable parameters in your model.\n",
    "* example: If I want to train a logistic regression model on 4 different *regParam* and 3 different *elasticNetParam*, I will have 3 x 4 = 12 models to train and validate.\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.ml.classification import LogisticRegression\n",
    "blor = LogisticRegression(featuresCol='indexed_features', labelCol='label', family='binomial')\n",
    "\n",
    "from pyspark.ml.tuning import ParamGridBuilder\n",
    "param_grid = ParamGridBuilder().\\\n",
    "    addGrid(blor.regParam, [0, 0.5, 1, 2]).\\\n",
    "    addGrid(blor.elasticNetParam, [0, 0.5, 1]).\\\n",
    "    build()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Split data into training and test sets\n",
    "* Refer to the [logistic regression page](logistic-regression.ipynb) to see what data we used and how the training and test sets were generated.\n",
    "\n",
    "### Run k (k=4) folds cross validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.ml.evaluation import BinaryClassificationEvaluator\n",
    "evaluator = BinaryClassificationEvaluator()\n",
    "\n",
    "from pyspark.ml.tuning import CrossValidator\n",
    "cv = CrossValidator(estimator=blor, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=4)\n",
    "\n",
    "cvModel = cv.fit(training)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
